Basic Linear Algebra Subprograms (BLAS) is a

specification A specification often refers to a set of documented requirements to be satisfied by a material, design, product, or service. A specification is often a type of technical standard. There are different types of technical or engineering specificati ...

that prescribes a set of low-level routines for performing common

linear algebra Linear algebra is the branch of mathematics concerning linear equations such as: :a_1x_1+\cdots +a_nx_n=b, linear maps such as: :(x_1, \ldots, x_n) \mapsto a_1x_1+\cdots +a_nx_n, and their representations in vector spaces and through matrices. ...

operations such as

vector Vector most often refers to: *Euclidean vector, a quantity with a magnitude and a direction *Vector (epidemiology), an agent that carries and transmits an infectious pathogen into another living organism Vector may also refer to: Mathematic ...

addition,

scalar multiplication In mathematics, scalar multiplication is one of the basic operations defining a vector space in linear algebra (or more generally, a module in abstract algebra). In common geometrical contexts, scalar multiplication of a real Euclidean vector by ...

dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...

s, linear combinations, and

matrix multiplication In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the s ...

. They are the ''

de facto ''De facto'' ( ; , "in fact") describes practices that exist in reality, whether or not they are officially recognized by laws or other formal norms. It is commonly used to refer to what happens in practice, in contrast with ''de jure'' ("by la ...

'' standard low-level routines for linear algebra libraries; the routines have bindings for both C ("CBLAS interface") and Fortran ("BLAS interface"). Although the BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. BLAS implementations will take advantage of special floating point hardware such as vector registers or SIMD instructions. It originated as a Fortran library in 1979* and its interface was standardized by the BLAS Technical (BLAST) Forum, whose latest BLAS report can be found on the netlib website. This Fortran library is known as the ''reference implementation'' (sometimes confusingly referred to as ''the'' BLAS library) and is not optimized for speed but is in the public domain. Most computing libraries that offer linear algebra routines conform to common BLAS user interface command structures, thus queries to those libraries (and the associated results) are often portable between BLAS library branches, such as CUDA#Programming_abilities, cuBLAS (nvidia GPU, GPGPU), ROCm#rocBLAS_/_hipBLAS, rocBLAS (amd GPU, GPGP), and OpenBLAS. This interoperability is then the basis of functioning homogenous code implementations between heterzygous cascades of computing architectures (such as those found in some advanced clustering implementations). Examples of CPU-based BLAS library branches include: OpenBLAS, BLIS (software), BLIS (BLAS-like Library Instantiation Software), Arm Performance Libraries, Automatically Tuned Linear Algebra Software, ATLAS, and Intel Math Kernel Library (iMKL). AMD maintains a fork of BLIS that is optimized for the Advanced Micro Devices, AMD platform, although it is unclear whether integrated ombudsmen resources are present in that particular software-hardware implementation. ATLAS is a portable library that automatically optimizes itself for an arbitrary architecture. iMKL is a freeware and proprietary vendor library optimized for x86 and x86-64 with a performance emphasis on Intel processors. OpenBLAS is an open-source library that is hand-optimized for many of the popular architectures. The LINPACK benchmarks rely heavily on the BLAS routine General Matrix Multiply, gemm for its performance measurements. Many numerical software applications use BLAS-compatible libraries to do linear algebra computations, including LAPACK, LINPACK, Armadillo (C++ library), Armadillo, GNU Octave, Mathematica, MATLAB, NumPy, R (programming language), R, and Julia (programming language), Julia.

Background

With the advent of numerical programming, sophisticated subroutine libraries became useful. These libraries would contain subroutines for common high-level mathematical operations such as root finding, matrix inversion, and solving systems of equations. The language of choice was FORTRAN. The most prominent numerical programming library was IBM's Scientific Subroutine Package (SSP). These subroutine libraries allowed programmers to concentrate on their specific problems and avoid re-implementing well-known algorithms. The library routines would also be better than average implementations; matrix algorithms, for example, might use full pivoting to get better numerical accuracy. The library routines would also have more efficient routines. For example, a library may include a program to solve a matrix that is upper triangular. The libraries would include single-precision and double-precision versions of some algorithms. Initially, these subroutines used hard-coded loops for their low-level operations. For example, if a subroutine needed to perform a matrix multiplication, then the subroutine would have three nested loops. Linear algebra programs have many common low-level operations (the so-called "kernel" operations, not related to kernel (operating system), operating systems). Between 1973 and 1977, several of these kernel operations were identified. These kernel operations became defined subroutines that math libraries could call. The kernel calls had advantages over hard-coded loops: the library routine would be more readable, there were fewer chances for bugs, and the kernel implementation could be optimized for speed. A specification for these kernel operations using scalar (mathematics), scalars and

s, the level-1 Basic Linear Algebra Subroutines (BLAS), was published in 1979. BLAS was used to implement the linear algebra subroutine library LINPACK. The BLAS abstraction allows customization for high performance. For example, LINPACK is a general purpose library that can be used on many different machines without modification. LINPACK could use a generic version of BLAS. To gain performance, different machines might use tailored versions of BLAS. As computer architectures became more sophisticated, vector processor, vector machines appeared. BLAS for a vector machine could use the machine's fast vector operations. (While vector processors eventually fell out of favor, vector instructions in modern CPUs are essential for optimal performance in BLAS routines.) Other machine features became available and could also be exploited. Consequently, BLAS was augmented from 1984 to 1986 with level-2 kernel operations that concerned vector-matrix operations. Memory hierarchy was also recognized as something to exploit. Many computers have cache memory that is much faster than main memory; keeping matrix manipulations localized allows better usage of the cache. In 1987 and 1988, the level 3 BLAS were identified to do matrix-matrix operations. The level 3 BLAS encouraged block-partitioned algorithms. The LAPACK library uses level 3 BLAS. The original BLAS concerned only densely stored vectors and matrices. Further extensions to BLAS, such as for sparse matrices, have been addressed.

Functionality

BLAS functionality is categorized into three sets of routines called "levels", which correspond to both the chronological order of definition and publication, as well as the degree of the polynomial in the complexities of algorithms; Level 1 BLAS operations typically take linear time, , Level 2 operations quadratic time and Level 3 operations cubic time. Modern BLAS implementations typically provide all three levels.

Level 1

This level consists of all the routines described in the original presentation of BLAS (1979), which defined only ''vector operations'' on Stride of an array, strided arrays:

s, norm (mathematics), vector norms, a generalized vector addition of the form :

\boldsymbol \leftarrow \alpha \boldsymbol + \boldsymbol

(called "axpy", "a x plus y") and several other operations.

Level 2

This level contains ''matrix-vector operations'' including, among other things, a ''ge''neralized ''m''atrix-''v''ector multiplication (gemv): :

\boldsymbol \leftarrow \alpha \boldsymbol \boldsymbol + \beta \boldsymbol

as well as a solver for in the linear equation :

\boldsymbol \boldsymbol = \boldsymbol

with being triangular. Design of the Level 2 BLAS started in 1984, with results published in 1988. The Level 2 subroutines are especially intended to improve performance of programs using BLAS on vector processors, where Level 1 BLAS are suboptimal "because they hide the matrix-vector nature of the operations from the compiler."

Level 3

This level, formally published in 1990, contains ''matrix-matrix operations'', including a "general

" (gemm), of the form :

\boldsymbol \leftarrow \alpha \boldsymbol \boldsymbol + \beta \boldsymbol,

where and can optionally be transposed or hermitian conjugate, hermitian-conjugated inside the routine, and all three matrices may be strided. The ordinary matrix multiplication can be performed by setting to one and to an all-zeros matrix of the appropriate size. Also included in Level 3 are routines for computing :

\boldsymbol \leftarrow \alpha \boldsymbol^ \boldsymbol,

where is a triangular matrix, among other functionality. Due to the ubiquity of matrix multiplications in many scientific applications, including for the implementation of the rest of Level 3 BLAS, and because faster algorithms exist beyond the obvious repetition of matrix-vector multiplication, gemm is a prime target of optimization for BLAS implementers. E.g., by decomposing one or both of , into Block matrix, block matrices, gemm can be Matrix multiplication algorithm#Divide-and-conquer algorithm, implemented recursively. This is one of the motivations for including the parameter, so the results of previous blocks can be accumulated. Note that this decomposition requires the special case which many implementations optimize for, thereby eliminating one multiplication for each value of . This decomposition allows for better locality of reference both in space and time of the data used in the product. This, in turn, takes advantage of the CPU cache, cache on the system. For systems with more than one level of cache, the blocking can be applied a second time to the order in which the blocks are used in the computation. Both of these levels of optimization are used in implementations such as Automatically Tuned Linear Algebra Software, ATLAS. More recently, implementations by Kazushige Goto have shown that blocking only for the L2 cache, combined with careful amortized analysis, amortizing of copying to contiguous memory to reduce translation lookaside buffer, TLB misses, is superior to Automatically Tuned Linear Algebra Software, ATLAS. A highly tuned implementation based on these ideas is part of the GotoBLAS, OpenBLAS and BLIS (software), BLIS. A common variation of is the , which calculates a complex product using "three real matrix multiplications and five real matrix additions instead of the conventional four real matrix multiplications and two real matrix additions", an algorithm similar to Strassen algorithm first described by Peter Ungar.

Implementations

; Accelerate: Apple Inc., Apple's framework for macOS and IOS (Apple), iOS, which includes tuned versions of BLAS and LAPACK. ; Arm Performance Libraries: Arm Performance Libraries, supporting Arm 64-bit AArch64-based processors, available from Arm Ltd., Arm. ; ATLAS: Automatically Tuned Linear Algebra Software, an Open-source software, open source implementation of BLAS application programming interface, APIs for C and Fortran, Fortran 77. ; BLIS (software), BLIS: BLAS-like Library Instantiation Software framework for rapid instantiation. Optimized for most modern CPUs. BLIS is a complete refactoring of the GotoBLAS that reduces the amount of code that must be written for a given platform. ; C++ AMP BLAS: The C++ AMP BLAS Library is an Open-source software, open source implementation of BLAS for Microsoft's AMP language extension for Visual C++. ; cuBLAS: Optimized BLAS for NVIDIA based GPU cards, requiring few additional library calls. ; NVBLAS: Optimized BLAS for NVIDIA based GPU cards, providing only Level 3 functions, but as direct drop-in replacement for other BLAS libraries. ; clBLAS: An OpenCL implementation of BLAS by AMD. Part of the AMD Compute Libraries. ; clBLAST: A tuned OpenCL implementation of most of the BLAS api. ; Eigen BLAS: A Fortran, Fortran 77 and C BLAS library implemented on top of the Mozilla license, MPL-licensed Eigen (C++ library), Eigen library, supporting x86, x86-64, ARM architecture, ARM (NEON), and PowerPC architectures. ; ESSL: IBM's Engineering and Scientific Subroutine Library, supporting the PowerPC architecture under AIX operating system, AIX and Linux. ; GotoBLAS: Kazushige Goto's BSD-licensed implementation of BLAS, tuned in particular for Intel Nehalem (microarchitecture), Nehalem/Intel Atom, Atom, VIA Technologies, VIA VIA Nano, Nanoprocessor, AMD Opteron. ;GNU Scientific Library: Multi-platform implementation of many numerical routines. Contains a CBLAS interface. ; HP MLIB: Hewlett-Packard, HP's Math library supporting IA-64, PA-RISC, x86 and Opteron architecture under HP-UX and Linux. ; Intel MKL: The Intel Math Kernel Library, supporting x86 32-bits and 64-bits, available free from Intel. Includes optimizations for Intel Pentium (brand), Pentium, Intel Core, Core and Intel Xeon CPUs and Intel Xeon Phi; support for Linux, Microsoft Windows, Windows and macOS. ; MathKeisan: NEC Corporation, NEC's math library, supporting NEC SX architecture under SUPER-UX, and Itanium under Linux ; Netlib BLAS: The official reference implementation on Netlib, written in Fortran, Fortran 77. ; Netlib CBLAS: Reference C interface to the BLAS. It is also possible (and popular) to call the Fortran BLAS from C. ; OpenBLAS: Optimized BLAS based on GotoBLAS, supporting x86, x86-64, MIPS architecture, MIPS and ARM architecture, ARM processors. ; PDLIB/SX: NEC Corporation, NEC's Public Domain Mathematical Library for the NEC NEC SX architecture, SX-4 system. ; rocBLAS: Implementation that runs on AMD GPUs via ROCm. ;SCSL : Silicon Graphics, SGI's Scientific Computing Software Library contains BLAS and LAPACK implementations for SGI's Irix workstations. ; Sun Performance Library: Optimized BLAS and LAPACK for SPARC, Intel Core, Core and AMD64 architectures under Solaris 8, 9, and 10 as well as Linux. ; uBLAS: A generic C++ template class library providing BLAS functionality. Part of the Boost library. It provides bindings to many hardware-accelerated libraries in a unifying notation. Moreover, uBLAS focuses on correctness of the algorithms using advanced C++ features.

Libraries using BLAS

; Armadillo: Armadillo (C++ library), Armadillo is a C++ linear algebra library aiming towards a good balance between speed and ease of use. It employs template classes, and has optional links to BLAS/ATLAS and LAPACK. It is sponsored by NICTA (in Australia) and is licensed under a free license. ; LAPACK: LAPACK is a higher level Linear Algebra library built upon BLAS. Like BLAS, a reference implementation exists, but many alternatives like libFlame and MKL exist. ; Mir: An LLVM-accelerated generic numerical library for science and machine learning written in D (programming language), D. It provides generic linear algebra subprograms (GLAS). It can be built on a CBLAS implementation.

Similar libraries (not compatible with BLAS)

; Elemental: Elemental is an open source software for distributed memory, distributed-memory dense and sparse-direct linear algebra and optimization. ; HASEM: is a C++ template library, being able to solve linear equations and to compute eigenvalues. It is licensed under BSD License. ; LAMA: The Library for Accelerated Math Applications (Library for Accelerated Math Applications, LAMA) is a C++ template library for writing numerical solvers targeting various kinds of hardware (e.g. GPUs through CUDA or OpenCL) on distributed memory systems, hiding the hardware specific programming from the program developer ; MTL4: The Matrix Template Library version 4 is a generic C++ template library providing sparse and dense BLAS functionality. MTL4 establishes an intuitive interface (similar to MATLAB) and broad applicability thanks to generic programming.

Sparse BLAS

Several extensions to BLAS for handling Sparse matrix, sparse matrices have been suggested over the course of the library's history; a small set of sparse matrix kernel routines was finally standardized in 2002.

Batched BLAS

The traditional BLAS functions have been also ported to architectures that support large amounts of parallelism such as GPUs. Here, the traditional BLAS functions provide typically good performance for large matrices. However, when computing e.g., matrix-matrix-products of many small matrices by using the GEMM routine, those architectures show significant performance losses. To address this issue, in 2017 a batched version of the BLAS function has been specified. Taking the GEMM routine from above as an example, the batched version performs the following computation simultaneously for many matrices:

\boldsymbol[k] \leftarrow \alpha \boldsymbol[k] \boldsymbol[k] + \beta \boldsymbol[k] \quad  \forall k

The index

k

in square brackets indicates that the operation is performed for all matrices

k

in a stack. Often, this operation is implemented for a strided batched memory layout where all matrices follow concatenated in the arrays

A

B

and

C

. Batched BLAS functions can be a versatile tool and allow e.g. a fast implementation of exponential integrators and Magnus integrators that handle long integration periods with many time steps. Here, the matrix exponentiation, the computationally expensive part of the integration, can be implemented in parallel for all time-steps by using Batched BLAS functions.

References

External links

BLAS homepage
on Netlib.org

from LAPACK Users' Guide

One of the original authors of the BLAS discusses its creation in an oral history interview. Charles L. Lawson Oral history interview by Thomas Haigh, 6 and 7 November 2004, San Clemente, California. Society for Industrial and Applied Mathematics, Philadelphia, PA.

In an oral history interview, Jack Dongarra explores the early relationship of BLAS to LINPACK, the creation of higher level BLAS versions for new architectures, and his later work on the ATLAS system to automatically optimize BLAS for particular machines. Jack Dongarra, Oral history interview by Thomas Haigh, 26 April 2005, University of Tennessee, Knoxville TN. Society for Industrial and Applied Mathematics, Philadelphia, PA
How does BLAS get such extreme performance?
Ten naive 1000×1000 matrix multiplications (10¹⁰ floating point multiply-adds) takes 15.77 seconds on 2.6 GHz processor; BLAS implementation takes 1.32 seconds. * An Overview of the Sparse Basic Linear Algebra Subprograms: The New Standard from the BLAS Technical Forum {{Linear algebra Numerical linear algebra Numerical software Public-domain software with source code